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o : 

^ , Abstract 

. We consider the problem of regression learning for deterministic design and independent random errors. 

^ ' We start by proving a sharp PAC-Bayesian type bound for the exponentially weighted aggregate (EWA) 
under the expected squared empirical loss. For a broad class of noise distributions the presented bound 

' is valid whenever the temperature parameter p of the EWA is larger than or equal to 4cr-, where cr^ is the 

, noise variance. A remarkable feature of this result is that it is valid even for unbounded regression functions 

,_i ' and the choice of the temperature parameter depends exclusively on the noise level. 

PLh . Next, we apply this general bound to the problem of aggregating the elements of a finite-dimensional 

' linear space spanned by a dictionary of functions (p\,. . We allow M to be much larger than the 

y \ , sample size n but we assume that the true regression function can be well approximated by a sparse linear 

' combination of functions (pj. Under this sparsity scenario, we propose an EWA with a heavy tailed prior 

C/3 , and we show that it satisfies a sparsity oracle inequality with leading constant one. 

Finally, we propose several Langevin Monte-Carlo algorithms to approximately compute such an EWA 

^•f-^ , when the number M of aggregated functions can be large. We discuss in some detail the convergence of 

^ ' these algorithms and present numerical experiments that confirm our theoretical findings. 

m : 

, Keywords: Sparse learning, regression estimation, logistic regression, oracle inequalities, sparsity prior, 

' Langevin Monte-Carlo. 

' 1. Introduction 

o^ : 

■ In recent years a great deal of attention has been devoted to learning in high-dimensional models under 
^ \ the sparsity scenario. This typically assumes that, in addition to the sample, we have a finite dictionary of 

■ '~j ■ very large cardinality such that a small set of its elements provides a nearly complete description of the 

r> ' underlying model. Here, the words "large" and "small" are understood in comparison with the sample size. 

■ Sparse learning methods have been successfully applied in bioinformatics, financial engineering, image 
processing, etc. (see, e.g., the survey in [44]). 

A popular model in this context is linear regression. We observe n pairs (X\, Yi), . . . , (X„, Y„), where 
each Xi - called the predictor - belongs to M*' and F, - called the response - is scalar and satisfies F,- = 
XJAq + with some zero-mean noise The goal is to develop inference on the unknown vector A(, e R*'. 

In many applications of linear regression the dimension of Z,- is much larger than the sample size, i.e., 
M » n. It is well-known that in this case classical procedures, such as the least squares estimator, do not 
work. One of the most compelling ways for dealing with the situation where M » n is to suppose that the 
sparsity assumption is fulfilled, i.e., that Ao has only few coordinates different from 0. This assumption is 
helpful at least for two reasons: The model becomes easier to interpret and the consistent estimation of Aq 
becomes possible if the number of non-zero coordinates is small enough. 

During the last decade several learning methods exploiting the sparsity assumption have been discussed 
in the literature. The ^i-penalized least squares (Lasso) is by far the most studied one and its statistical 
properties are now well understood (cf., e.g., flHSB, 31, 39, 45| and the references cited therein). The 



Lasso is particularly attractive by its low computational cost. For instance, one can use the LARS algo- 
rithm 1. 19.1 . which is quite popular. Other procedures based on closely related ideas include the Elastic Net 
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1471. the Dantzig selector and the least squares with entropy penalization |27|. However, one impor- 
tant limitation of these procedures is that they are provably consistent under rather restrictive assumptions 
on the Gram matrix associated to the predictors, such as the mutual coherence assumption 1 18 1, the uni- 
form uncertainty principle jsl, the irrepresentable [46] or the restricted eigenvalue conditions. This is 
somewhat unsatisfactory, since it is known that, at least in theory, there exist estimators attaining optimal 
accuracy of prediction under almost no assumption on the Gram matrix. This is, in particular, the case for 
the ^o-penalized least squares estimator Thm. 3.1]. However, the computation of this estimator is an 
NP-hard problem. We finally mention the paper l'42'l, which brings to attention the fact that the empirical 
Bayes estimator in Gaussian regression with Gaussian prior can effectively recover the sparsity pattern. 
This method is realized in f42[| via the EM algorithm. However, its theoretical properties are not explored, 
and it is not clear what are the limits of application of the method beyond the considered set of numerical 
examples. 

In Elll] we proposed another approach to learning under the sparsity scenario, which consists in 
using an exponentially weighted aggregate (EWA) with a properly chosen sparsity-favoring prior. There 
exists an extensive literature on EWA. Some recent results focusing on the statistical properties can be found 
in iB Isi [Tli l24[ Eii i43l l . Ap phcation of EWA to the single-index regression and Gaussian graphical models 
has been developed in 12011 and 112 ill , respectively. Procedures with exponential weighting received much 
attention in the literature on the on-hne learning, see [12, 22, 40] , the monograph [14;] and the references 
cited therein. 

The main message 

of ([lira, is that the EWA with a properly chosen prior is able to deal with the 
sparsity issue. In particular, 

Mia 

prove that such an EWA satisfies a sparsity oracle inequality (SOI), 
which is more powerful than the best known SOI for other common procedures of sparse recovery. An 
important point is that almost no assumption on the Gram matrix is required. In the present work we 
extend this analysis in two directions. First, we prove a sharp PAC-Bayesian bound for a large class of 
noise distributions, which is valid for the temperature parameter depending only on the noise distribution. 
We impose no restriction on the values of the regression function. This result is presented in Section|2] The 
consequences in the context of linear regression under sparsity assumption are discussed in Section[3] 

The second problem that we analyze here is the computation of EWA with the sparsity prior Since 
we want to deal with large dimensions M, computation of integrals over R'^ in the definition of this esti- 
mator can be a hard problem. Therefore, we suggest an approximation based on Langevin Monte-Carlo 
(LMC). This is described in detail in Section|4] Section|5]contains numerical experiments that confirm fast 
convergence properties of the LMC and demonstrate a nice performance of the resulting estimators. 



2. PAC-Bayesian type oracle inequality 

Throughout this section, as well as in Section [3] we assume that we are given the data (Z,, F,), / - 
generated by the non-parametric regression model 

F,=/(Z,)+^i, /=!,...,«, (1) 

with deterministic design Zi, . . . ,Z„ and random errors We use the vector notation Y = / + f , where 
^ - (^1, . . . ,^«)^ and the function /(■) is identified with the vector / = (/(Zi), . . . ,/(Z„))^. The space 
Z containing the design points Z, can be arbitrary and / is a mapping from X to R. For each function 
/z : 2^ — > R, we denote by \\h\\„ the empirical norm (i h{Zi)^y'^. Along with these notation, we 
will denote by |[v|[/, the ^^-norm of a vector v = (vi, . . . , v„) e R", that is \\v\\p - Yj'U K'/I''. I ^ p < 00, 
||v|[oo - max/ |v,| and |[v|[o is the number of nonzero entries of v. With this notation, WfWj = 

Assume that we are given a collection {f^ : A e A) of functions : 2 — » R that will serve as building 
blocks for the learning procedure. The set A is assumed to be equipped with a cr-algebra and the mappings 
A i-> are assumed to be measurable with respect to this cr-algebra for all z & Z- Let tt be a probability 
measure on A, called the prior, and let j6 be a positive real number, called the temperature parameter. We 
define the EWA by 

fniz)^ f fA(z)n„4dA), 
Ja 
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where n„fi is the (posterior) probabihty distribution 

n„,p{dA) exp{ -p-'\\Y-f,\\l]n(dA\ 

and = (/i(Zi), . . .,f;[{Z„)y. We denote by L the smallest positive number, which may be equal to +00, 
such that 

(A,A')eA' ^ max|/i(Z,.)-/,,(Z,)| «;L (2) 

In the sequel, we use the convention = and, for any function v : M — > M, we denote by ||v||co its 
Loo(M)-norm. 

In order to get meaningful statistical results on the accuracy of the EWA, some conditions on the noise 
are imposed. In addition to the standard assumptions that the noise vector ^ = (^i, ■ • ■ ,^«)^ has zero 
mean and independent identically distributed (iid) coordinates, we require the following assumption on the 
distribution of ^\ . 

Assumption N. For any y > small enough, there exist a probability space and two random variables ^ 
and ^ defined on this probability space such that 

i) ^ has the same distribution as the regression errors 

ii) ^ + C has the same distribution as (1 + y)^ and the conditional expectation satisfies E[^ | ^] =0, 

iii) there exist fo e (0, 00] and a bounded Borel function v : R — » R+ such that 

— logE[g'qg^a] 

hm sup z < 1, 

y^o (f,fl)e[-/„,fo]xsupp(a t^yvia) 

where supp(^) is the support of the distribution of ^. 

Many symmetric distributions used in applications satisfy Assumption N with functions v such that ||v||oo 
is a multiple of the variance of the noise ^j. This follows from Remarks[T]|6]given at the end of this section 
and their combinations. 

Theorem 1. Let Assumption N be satisfied with some fiinction v and let (|2) hold. Then for any priori, any 
probability measure p on A and any j3 ^ max(4||v||a3, 2L/fo) we have 

nwK-fwl]^ r \\f-fA\\lp(dA) + ^^^^^, 

Ja n 
where 'K(- , ■ ) stands for the Kullback-Leibler divergence. 

Prior to presenting the proof, let us note that Theorem [T] is in the spirit of [16, Theorems 1,2], but is 
better in several aspects. First, the main assumption ensuring the validity of the oracle inequality involves 
the distribution of the noise alone, while 1 16, Theorem 2] relies on an assumption (denoted by (C) in lfl6ll ) 
that ties together the distributional properties of the noise and the nature of the dictionary { f^}. A second 
advantage is that Assumption N is independent of the sample size n and, consequently, suggests a choice 
of the parameter p that does not change with the sample size. Theorem 1 of 1 16] also has these advantages 
but it is valid only for a very restricted class of noise distributions, essentially for the Gaussian and uniform 
noise. As we shall see later in this section, Theorem[T]leads to a choice of the tuning parameter /3, which is 
very simple and guarantees the validity of a strong oracle inequality for a large class of noise distributions. 

Proof of Theorem^] It suffices to prove the theorem for p such that \\f^ - f\\l p(dA) < 00 and p <s: tt 
(implying 'K{p, k) < 00), since otherwise the result is trivial. 

We first assume thatyS > 4||v||oo and that L < 00. Let 7 > Obe a small number. Let now (^1, ^i), . . . , (^„,^„) 
be a sequence of iid pairs of random variables defined on a common probability space such that (^, , sat- 
isfy conditions i)-iii) of Assumption N for any /. The existence of these random variables is ensured by 
Assumption N. We use here the same notation ^, as in model ([T]i, since it causes no ambiguity. 

Set /i^ =/,-/,/,=/„-/, ^ = (^i,...,^„)^ U(h,h') = \\h\\l + 2ffh' and AU(h,h',h") = 
(\\h\\l - Wh'Wl) + 2(h - h'Yh" for any pair h, h' , h" e W. With this notation we have 

min - ft] = nml] = Efii7^i|2 + —Va 

L ny ^ 
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Therefore, E[\\f„ - f\\l] = 5 + 5 1, where 

5 1 = — E[ log exp( ) nn4dA)\. 

We first bound the term S . To this end, note that 

nnRKdA) - -p. n{dA) 

J^exp{-/3-'Uih,.,^)}nidw) 



and, therefore. 



By part ii) of Assumption N and the independence of vectors (^,, for different values of /, the probability 
distribution of the vector + ^)/( 1 + y) coincides with that of ^. Therefore, + f )/( 1 + y) may be replaced 
by ^ inside the second expectation. Now, using the Holder inequality, we get 

S < ^Eflog f e-''^y»""^''-^MdA)\. 

Next, by a convex duaUty argument ifiol p. 160], we find 

S ^ f \\h,\\lp(dA) + ' 

J A 



„2 „.^n , mp,^) 

n(l + y) 



Let us now bound the term S i . According to part iii) of Assumption N, there exists yo > such that 

Vy ^ yo, 

logEfe'^lf = a] 
sup ^ , i ^ y(fl)(l + o^(l)), V fl e M. 

In what follows we assume that y ^ yo- Since for every /, |2j6"'(/zi(Zi) - /i(Zi))| ^ 2/3^^L ^ fo, using 
Jensen's inequality we get 

Si < ^E[log j^expj - j(\\h£„ - ||I||2)}0,E(exp{|]2/?-'(/2,(Z,) -I(Z,))4|f)7r(«'^)] 

^ ^E[log j^exp{ - ^(\\hA\l - ml)}e, exp{ ^"'J^i'"^ ||/i, -I||2(l + o^(l))};r(flfi)l. 



For y small enough (y ^ yo), this entails that up to a positive multiplicative constant, the term S i is 
bounded by the expression E[ log exp ( - '■^^^^-^y^)0AJT{dA)], where 

Using 1 15, Lemma 3] and Jensen's inequality we obtain 5 1 ^ for any y ^ (y3 - 4||v'||oo)/4nL. Thus, we 
proved that 



nmil]^ \ \\h,f„p(dA) + ^^^^^ 

n(i + y) 



Ja 



for any y ^ yo A (/? - 4||v||oo)/4nL. Letting y tend to zero, we obtain 



nwhwi]^ I \\hA\lp(dA) + ^^^^^^ 
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for any yS > max(4||v||oo, 2L/f()). Fatou's lemma allows us to extend this inequality to the case fi — 
max(4||v|U,2L/fo). 

To cover the case L - +00, Jq = +00, we fix some Lq e (0, 00) and apply the obtained inequality to the 
truncated prior K^'idA) oc lf^^,(A)n(dA), where L' e (Lo,oo) and A/,- - {A e A : max,- |/i(Z,)| < L'}. We 
obtain that for any measure p n supported by A^.^, 



Ja 



E[ini„] \\h,r„p(dA) + ' 



^ I mipidA).^^^. 



I' 



One easily checks that tends a.s. to h and that the random variable sup^,^^ \\h^ ||^l(max; |^,| ^ C) is 
integrable for any fixed C. Therefore, by Lebesgue's dominated convergence theorem we get 



E[||/!||;;]l(max|^,.|s=:C)] < I \\h,\\lp{dA) + 1^^^^^. 



Letting C tend to infinity and using Lebesgue's monotone convergence theorem we obtain the desired 
inequality for any probability measure p which is absolutely continuous w.r.t. n and is supported by A^ 
for some Lq > 0. If piAj^) < 1 for any Lq > 0, one can replace p by its truncated version p^' and use 
Lebesgue's monotone convergence theorem to get the desired result. □ 

The following remarks provide examples of noise distributions, for which Assumption N is satisfied. 
Proofs of these remarks are given in the Appendix. 

Remark 1 (Gaussian noise). If^i is drawn according to the Gaussian distribution N{Q,cr^), then for any 
y > one can choose ( independently of^ according to the Gaussian distribution 7V(0, (2j + y^)cr^). This 
results in v(a) = cP' and, as a consequence, Theorem\J\holds for any /3 ^ 4cr^. Note that this reduces to 
the Leung and Barron 's H^l result if the prior n is discrete. 

Remark 2 (Rademacher noise). If^\ is drawn according to the Rademacher distribution, i.e. P(^i — ±cr) — 
1/2, then for any y > one can define ^ as follows: 

^ = (1 + y)crsgn[cr-'^ -(l+y)U]-^, 

where U is distributed uniformly in [—1,1] and is independent of ^. This results in via) = cr^ and, as a 
consequence, Theorem\J}holds for any /3 ^ 4cr^ = 4E[^j ]. 

Remark 3 (Stability by convolution). Assume that and ^'^ are two independent random variables. If £,\ 
and satisfy Assumption N with to — 00 and with functions v(a) and v'(a), then any linear combination 
a^i + ff'^'j satisfies Assumption N with to = 00 and the v-function a^via) + (a'^^Via). 

Remark 4 (Uniform distribution). The claim of preceding remark can be generalized to linear combi- 
nations of a countable set of random variables, provided that the series converges in the mean squared 
sense. In particular, if is drawn according to the symmetric uniform distribution with variance cr^, then 
Assumption N is fulfilled with to — 00 and v{d) = cr^. This can be proved using the fact that has the 
same distribution as crYa^i 2"''7i, where rji are iid Rademacher random variables. Thus, in this case the 
inequality ofTheorem\J]is true for any fi ^ Acr^. 

Remark 5 (Laplace noise). If^\ is drawn according to the Laplace distribution with variance cr^, then for 
any 7 > one can choose ^ independently of^ according to the distribution associated to the characteristic 
function 

^ ~ (1 +r)^' "^1 + (1 + y)\a-t)^ /2^' 

One can observe that the distribution of ( is a mixture of the Dirac distribution at zero and the Laplace 
distribution with variance (1 + y)^(T^. This results in v(a) = 2<t^/(2 — cr^t^ and, as a consequence, by 
taking to — l/o'^, we get that Theorem\J]holds for any p ^ max(8cr^, 2La). 
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Remark 6 (Bounded symmetric noise). Assume that the errors ^, are symmetric and that P{\^i\ ^ B) - I 
for some B e (0, oo). Let U ~ U{[—1, 1]) be a random variable independent of ^. Then, ^ = (1 + 
y)!^! sgn[sgn(^) - (1 + - ^ satisfies Assumption N with v{a) - a^. Since ||v||oo ^ B^, we obtain that 
Theorem\I\is valid for any f3 ^ 4B^. 

Consider now the case of finite A. W.l.o.g. we suppose that A = {1, . . . , M}, [f^, A e A} - {fi, . . . 
and we take the uniform prior n{A - j) = I /M. From Theorem[T]we immediately get the following sharp 
oracle inequality for model selection type aggregation. 

Corollary 1. Let Assumption N be satisfied with some function v and let (O hold. Then for the uniform 
prior n{A - j) - I /M, j - 1 , . . . , M, and any j3 ^ max(4||v||oo, 2L/ to) we have 

ml -ft] ^ min + 

j=i,...,M n 

This corollary can be compared with bounds for combining procedures in the theory of prediction of 
deterministic sequences |41, 23, 12, 2^ T2,[l3l- With our notation, the bounds proved in these works can 



be written is the form 

1 " 1 " 

- YiYi - f*(Zi)f ^ Ci , min - V (K,- - //Z,))^ + ' — - . (3) 
n ^—^ j=i,...,M n -4— ' 



. CzlogM 



Here .//(Z,) is interpreted as the value of F, predicted by the jth procedure, f*{Zi) as an aggregated forecast, 
and Ci ^ 1, C2 > are constants. Such inequalities are proved under the assumption that F,'s are 
deterministic and uniformly bounded. When Ci = 1, applying (O to random uniformly bounded F,'s 
from model ([U with E(^,) - and taking expectations can yield an oracle inequality similar to that of 
Corollary [1] However, the uniform boundedness of y,'s supposes that not only the noise but also the 
functions / and fj are uniformly bounded. Bounds on / should be a priori known for the construction of the 
aggregated rule /* in (|3]l but in practice they are not always available. Our results are free of this drawback 
because they hold with no assumption on /. We have no assumption on the dictionary {/i , . . . , /m) neither. 



3. Sparsity prior and SOI 

In this section we introduce the sparsity prior and present a sparsity oracle inequality (SOI) derived 
from Theorem[T] 

In what follows we assume that A c K*' for some positive integer M. We will use boldface letters to 
denote vectors and, in particular, the elements of A. For any square matrix A, let Tr(A) denote the trace 
(sum of diagonal entries) of A. Furthermore, we focus on the particular case where 'Fa is the image of 
a convex polytope in M*^ by a link function g : M. ^ M.. More specifically, we assume that, for some 
R e (0, +00] and for a finite number of measurable functions {^yj^-^i j^, 

Ta = IfAz) = g( 2 Vz e Z| i € satisfies ^ r\, 

where \\A\\\ - 2; I'^/l stands for the ^i-norm. The link function g is assumed twice continuously difFeren- 
tiable and known. Typical examples of link function include the linear function g{x) - x, the exponential 
function g{x) - e ", the logistic function g{x) - e'^ He " + 1), the cumulative distribution function of the 
standard Gaussian distribution, and so on. 

If, in addition, / e T^, then model ^ reduces to that of single-index regression with known link 
function. In the particular case of g{x) - x, this leads to the linear regression defined in the Introduction. 
Indeed, it suflices to take 

Z, = (0i(Z,),...,0m(Z,))^, /=!,...,«. 

This notation will be used in the rest of the paper along with the assumption that Z,- are normalized so that 
all the diagonal entries of matrix i ^i^J ^"e equal to one. 



6 



□ .1 OB 0.06 04 0.02 -0.02 -0 04 -0.06 -0 OB -0 1 

Figure 1: The boxplots of a sample of size lO'* drawn from the scaled Gaussian, Laplace and Student ;(3) distributions. In all the 
three cases the location parameter is and the scale parameter is 10"^. 




The family 'Tti defined above satisfies inequality (O with L - 2/?||^'||ooL0, where - max, j |0j(Z,)| 
and ll^'IU is the maximum of the derivative of g on the interval [-TJL^, TJL^]. Indeed, since A is the l\ ball of 
radius R in and 4>fi are bounded by L^, the real numbers m,- - A^Xj and u'. - A'^Z, belong to the interval 

\-RL^,RL^] for every A and A! from A. Consequently, IXi(Z,) - /^'(Z,)| = \g(ui) - g{u'.)\ - g'{s)ds is 
bounded by ||g'||oo|Mi - m'I, the latter being smaller than 2R\\g'\\coL^. 

We allow M to be large, possibly much larger than the sample size n. If M » «, we have in mind that 
the sparsity assumption holds, i.e., there exists A* G R'*^ such that / in ([T]i is close to /i* for some A* having 
only a small number of non-zero entries. We handle this situation via a suitable choice of prior n. Namely, 
we use a modification of the sparsity prior proposed in [15] . It should be emphasized right away that we 
wiU take advantage of sparsity for the purpose of prediction and not for data compression. In fact, even if 
the underlying model is sparse, we do not claim that our estimator is sparse as well, but we claim that it is 
quite accurate under very mild assumptions. On the other hand, some numerical experiments demonstrate 
the sparsity of our estimator and the fact that it recovers correctly the true sparsity pattern in examples 
where the (restrictive) assumptions mentioned in the Introduction are satisfied (cf. Section |5]l. However, 
our theoretical results do not deal with this property. 

To specify the sparsity prior n we need the Huber function w : M — > M defined by 

if \t\ < 1 
1, otherwise. 

This function behaves very much like the absolute value of f, but has the advantage of being difi'erentiable 
at every point f e R. Let r and a be positive numbers. We define the sparsity prior 

n{dX) = n KIWIi < R) dA, (4) 

where Ca,T,R is the normalizing constant. 

Since the sparsity prior (HJl looks somewhat complicated, an heuristical explanation is in order Let us 
assume that R is large and a is small so that the functions e""(H) and ]l(||/l||i ^ R) are approximately equal 
to one. With this in mind, we can notice that n is close to the distribution of V2tY, where Y is a random 
vector having iid coordinates drawn from Student's t-distribution with three degrees of freedom. In the 
examples below we choose a very small t, smaller than 1/n. Therefore, most of the coordinates of tY are 
very close to zero. On the other hand, since Student's t-distribution has heavy tails, a few coordinates of 
tY are quite far from zero. 

These heuristics are illustrated by Figure[T]presenting the boxplots of one realization of a random vector 
in with iid coordinates drawn from the scaled Gaussian, Laplace (double exponential) and Student 

t{3) distributions. The scaling factor is such that the probability densities of the simulated distributions are 
equal to 100 at the origin. The boxplot which is most likely to represent a sparse vector corresponds to 
Student's f(3) distribution. 

The relevance of heavy tailed priors for dealing with sparsity has been emphasized by several authors 



(see 0371 Section 2.1] and references therein). However, most of this work focused on logarithmically 



concave priors, such as the multivariate Laplace distribution. Also in wavelet estimation on classes of 
"sparse" functions 12311 and ll33ll invoke quasi-Cauchy and Pareto priors. Bayes estimators with heavy- 
tailed priors in sparse Gaussian shift models are discussed in [JJ. 

The next theorem provides a SOI for the EWA with the sparsity prior (|4|. 

Theorem 2. Let Assumption N be satisfied with some function v and let (|2| hold. Take the prior n defined 
in d?]) and p ^ max(4||v||oo, 2L/f()). Assume that R > 2Mt and a ^ 1/(4Mt). Then for all A* such that 
\\A*\\x sS, R - 2Mt we have 

E[||^ - ftA ^ Wfr - /ll^ ^ Z log (1 + ?) + ^ 

M 

with Cgj - 1 ifg(x) - X and Cgj - Wg'Wl, + H^'lUdlglU + ll/IU)/or other link functions g. 
Proof. Let us define the probability measure po by 

^(^) ^ (^(^ - ^B^ilMrM - ^*)- (5) 

Since \\A*\\\ ^ R - 2Mt, the condition A - A* e B\{2Mt) implies that A e B\{R) and, therefore, po is 
absolutely continuous w.r.t. the sparsity prior n. In view of Thm.[Tl we have 

mln-fwl]^ r \\f,-f\ipo{dA) + ^^^^. 

Since fAZi) = g(X]A) we have V,[(/,(Z,) - /(Z,))^] = 2g' {X] A){fAZd - f{Zd)Xi and 

V5[(/i(Z,) - f(Zdf] = 2[g'{X'jAf + g"(X]A){g(X]A) - /(Z,)))Z,Z^ 

One can remark that the factor of XiX] in the last display is bounded by Cgj. Therefore, in view of the 
Taylor formula, 

(/i(Z,) - /(Z,))2 ^ (X,.(Z,) - f{Zi)f + 2(fr(Zd - f(Z!))g'(XjA*)Xj(A - A*) 
+Cgj[Xj{A-A*)]\ 

By the symmetry of pa with respect to A*, the integral - A*)pQ{dA) vanishes. Combining this with the 
fact that the diagonal entries of the matrix ^ 2, XiX'J are equal to one, we obtain 

r WfA - ft PoidA) ^ Wfr - ft + C,j f \\A - A'Wl po(dA). 

J A Jk" 

To complete the proof, we use the following technical result. 
Lemma 3. For every integer M larger than 1, we have: 

M 



/ 

Jr 



(^1 - A\fpQ{dA) ^ AT^e^''"^ -Kipo, n) ^ 2(a||i*||i + 1) + 4 V log(l + \A]\/t). 



The proof of this lemma is postponed to the appendix. It is obvious that inequality ^ follows from 
Lemma 3, since ^„ \\A - A'W^ po(dA) - M - A*}^ poidA) and, under the assumptions of the theorem, 

Theorem|2]can be used to choose the tuning parameters T,a,R when M » n. The idea is to choose 
them such that both terms in the second line of (|5]l were of the order (9(1 /n). This can be achieved, 
for example, by taking r- ~ (M«)"' and R - 0(Mt). Then the term ^ H'jLi log(l + |/1*|/t) becomes 
dominating. It is important that the number M* of nonzero summands in this term is equal to the number 
of nonzero coordinates of A*. Therefore, for sparse vectors A*, this term is rather small, namely of the order 
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M* (log M)/n, which is the same rate as achieved by other methods of sparse recovery, cf. SI^ItIHI- An 
important difference compared with these and other papers on -based sparse recovery is that in Theorem 
|2] we have no assumption on the dictionary {<pi,. . . , cPm}- 

Note that in the case of logistic regression the Hnk function g, as well as its first two derivatives, are 
bounded by one. Therefore, since the logistic model is mainly used for estimating functions / with values 
in [0, 1], Theorem |2] holds in this case with Cgj ^ 3. Similarly, for the probit model [i.e., when the link 
function g is the cdf of the standard Gaussian distribution) and / with values in [0, 1], one easily checks 
thatC,,,/ (tt^' + l)/2. 



4. Computation of the EW-aggregate by the Langevin Monte-Carlo 

In this section we suggest Langevin Monte-Carlo (LMC) procedures to approximately compute the 
EWA with the sparsity prior when M » n. 

4.1. Langevin Diffusion in continuous time 

We start by describing a continuous-time Markov process, called the Langevin diffusion, that will play 
the key role in this section. Let V : K*' — > M be a smooth function, which in what follows will be referred 
to as potential. We will assume that the gradient of V is locally Lipschitz and is at most of linear growth. 
This ensures that the stochastic differential equation (SDE) 

dL, = vy(L,) dt + y/ldW,, Lo = ^, f ^ (6) 

has a unique strong solution, called the Langevin diffusion. In the last display, W stands for an M- 
dimensional Brownian motion and Aq is an arbitrary deterministic vector from M*^. It is well known that 
the process {L,),j,o is a homogeneous Markov process and a semimartingale, cf. [36, Thm. 12.1]. 

As a Markov process, L may be transient, null recurrent or positively recurrent. The latter case, which 
is the most important for us, corresponds to the process satisfying the law of large numbers and implies the 
existence of a stationary distribution. In other terms, if L is positively recurrent, there exists a probability 
distribution Py on M"^ such that the process L is stationary provided that the initial condition Aq is drawn 
at random according Py- A remarkable property of the Langevin diffusion — making it very attractive for 
computing high-dimensional integrals — is that its stationary distribution, if exists, has the density 

Pv(A) oc e^f-", A e R*', 

w.rt. the Lebesgue measure J^sl Thm. 10.1]. Furthermore, some simple conditions on the potential V 
ensure the positive recurrence of L. The following proposition gives an example of such a condition. 

Proposition 1 (|34], Thm 2.1). Assume that the function V is bounded from above. If there is a twice 
continuously differentiable function D : M.^ — ^ [1, °°) and three positive constants a, b and r such that 

VV(AfVD(A) + AD{A) ^ -aD{A) + bKMh s$ r), (7) 

for every A G M.^, then the Langevin diffusion L defined by ^ is D-geometrically ergodic, that is 

|e[/!(L,)|Lo = ^] - r h{A)py{A)dA\^RvD{Afi)p'y 

for every function h satisfying ||/i/D||oo ^ 1 and for some constants Ry > and pv & (0, 1). 

Function D satisfying ^ is often referred to as Lyapunov function and condition O is called the drift 



condition towards the set {A : \\A\\2 ^ r). Recall that the drift condition ensures geometrical mixing [132 , 
Theorem 16.1.5]. Specifically, for every function h such that ||/z^/D||oo ^ 1 and for every t,s > Q, 

\Co\ A„[h{L,),h(L,)]\ ^ RvD{Ao)pI''1 
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Combining this with the result of Proposition [T] it is not hard to check that if p^/D||oo ^ 1, then 



(i r h{L,)dt- { h{A)pv(X)dAf 
Jo Jr" 



J, (8) 



where C is some positive constant depending only on V. Note also that, in view of Proposition [T] the 
squared bias term in the bias-variance decomposition of the left hand side of ([8]) is of order 0(T^^). Thus, 
the main error term comes from the stochastic part. 

4.2. Langevin diffusion associated to EWA 

In what follows, we focus on the particular case g{x) = x. Given (Z,-, F,), i = with Z, e M*^ 

and Yi e M, we want to compute the expression 

.1 exp { - ||Y - XA\\l]7t(dA) 

A — — , (y) 

j^,exp{-f3-mY-XA\\l}7:(dA) 

where X - (Xi, . . ., X„y . In what follows, we deal with the prior 



assuming that R - +oo. As proved in Sections |2] and [3] this choice of the prior leads to sharp oracle 
inequalities for a large class of noise distributions. An equivalent form for writing ^ is 



f 



ApviA) dA, where pviA) oc e^*-*) 
with 

l|Y 

V(A) = - — 



V{A) = _ ^ {21og(T2 + a]) + cjiaAj)}. (10) 



A simple algebra shows that D{A) = e^'H-'H^ satisfies the drift condition (|7]l. A nice property of this 
Lyapunov function is the inequality ||/l||^ ^ a^^D{A). It guarantees that ([8]) is satisfied for the functions 
h(A) = Ai. 

Let us define the Langevin diffusion L, as solution of ^ with the potential V given in ( fTOl l and the 
initial condition Lq - 0. In what follows we will consider only this particular diffusion process. We define 
the average value 



1 

= - L, dt, 
i Jo 



T ^0. 



According to (O this average value converges as T — > oo to the vector A that we want to compute. Clearly, 
it is much easier to compute Lj- than A. Indeed, A involves integrals in M dimensions, whereas Lj- is a one- 
dimensional integral over a finite interval. Of course, to compute such an integral one needs to discretize 
the Langevin diffusion. This is done in the next subsection. 

4.3. Discretization 

Since the sample paths of a diffusion process are Holder continuous, it is easy to show that the Riemann 
sum approximation 



Lj^ -Y,LT,(Ti^,-Ti), 

/=() 



with Q — Ti) < T\ < . . . < Tm - T converges to Lj in mean square when the sampling is sufficiently dense, 
that is when max, |r,+i - r,| is small. However, when simulating the diffusion sample path in practice, it 
is impossible to follow exactly the dynamics determined by We need to discretize the SDE in order to 
approximate the solution. 
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A natural discretization for the SDE (|6]) is proposed by the Euler scheme with a constant step of dis- 
cretization h > 0, defined as 



Lf^i = Lf + hVV{L^) + yflh^,, Ll = 0, (1 1) 

for A; = 0, 1, . . . , [r//!]-l, where^i, • ■ • i.i.d. standard Gaussian random vectors in M*' and [x\ stands 
for the integer part of x € M. Obviously, the sequence {L^; k ^ 0] defines a discrete-time Markov process. 
Furthermore, one can show that this Markov process can be extrapolated to a continuous-time diffusion- 
type process which converges in distribution to the Langevin diffusion as /; — > 0. Here extrapolation means 
the construction of a process {Lf,/,; f e [0, T]} satisfying L^/, /, - for every k - 0, . . . , [T/h]. Such a 
process {L/,/,; f e [0, T]} can be defined as a solution of the SDE 

[r/Ai-i 

This amounts to connecting the successive values of the Markov chain by independent Brownian bridges. 
The Girsanov formula implies that the Kullback-Leibler divergence between the distribution of the process 
{L,; t e [0, T}] and the distribution of {L,,f ; t e [0, T}] tends to zero as h tends to zero. Therefore, it makes 
sense to approximate Lj by 

Proposition 2. Consider the linear model Y — X/l* + ^, where X is the nx M deterministic matrix and ^ is 
a zero-mean noise with finite covariance matrix. Then for X - X Pvi.^ dA with pviA) oc e^'^'*' and ViA) 
defined in ( 17 OH we have 

Proof. We present here a high-level overview of the proof deferring the details to the Appendix. 
Step 1 We start by showing that 



limE 



i,h dt 

Jo 



-E 1 f ~ 

Step 2 We then split the expression j L,,/, dt into two terms: 

^ f Lt^hdt^l- f L,,hH),A]{\\L,.h\\2)dt+l- f L,^hhA,+c<,]{\\Lt,h\\2)dt. (12) 
^ Jo / Jo / Jo 

Ti ll 

and show that the expected norm EIIT2II2 is bounded uniformly in h and T by some function of A that 
decreases to as A —> 00. Later A will be chosen as an increasing function of T . 

Step 3 We check that the Kullback-Leibler divergence between the distribution of (L/,a; ^ f ^ T) and 
of (L,; ^ f ^ r) tends to zero as /z — > 0. This implies the convergence in total variation and, as a 
consequence, we get 



limE 



r G{L,j,)dt- r G{A)pv{A)dA\\^'R\(\: f G{L,)dt- f G{A)pv{A)dA\ 

Jo Jk" ' i LV7 Jo JjjM / 



(13) 



for any bounded measurable function G : M*^ — > M. We use this result with G(A) - /l, l[o,A](imi2)5 
i = l,...,M. 

Step 4 To conclude the proof we use the fact that J^^^^^ Apv(A) dA tends to zero as A — > 00, and that by 
the ergodic theorem (cf. Proposition 1) the right hand side of (T3[ tends to as T — > 00. 

□ 
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This discretization algorithm is easily implementable and, for small values of h, Lj is very close to 
the integral X - ^ X pv{X) dX of interest. However, for some values of h, which may eventually be small 
but not enough, the Markov process {L^; ^ 0) is transient. Therefore, if h is not small enough the sum in 

- E J L 

the definition of Lji^ explodes |35|] . To circumvent this problem, one can either modify the Markov chain 
by incorporating a Metropolis-Hastings correction, or take a smaller h and restart the computations. 
The Metropolis-Hastings approach guarantees the convergence to the desired distribution. However, it 
considerably slows down the algorithm because of a significant probability of rejection at each step of 
discretization. The second approach, where we just take a smaller h, also slows down the algorithm but we 
keep some control on its time of execution. 



5. Implementation and experimental results 

In this section we give more details on the implementation of the LMC for computing the EW-aggregate 
in the linear regression model. 

5.1. Implementation 

The input of the algorithm we are going to describe is the triplet (Y, X, cr) and the tuning parameters 
{a,p, T, h, T), where 

- Y is the n-vector of values of the response variable, 

- X is the nx M matrix of predictor variables, 

- cr is the noise level, 

- y6 is the temperature parameter of the EW-aggregate, 

- a and t are the parameters of the sparsity prior, 

- h and T are the parameters of the LMC algorithm. 

The output of the proposed algorithm is a vector X e R*' such that, for every x e R*', x^X provides a 
prediction for the unobservable value of the response variable corresponding to x. The pseudo-code of the 
algorithm is given below. 



Input: Observations (Y, X, cr) and parameters (a,p, t, h, T) 


Output: The vector A 


Set 




[n,M]=size(X) ; 




L=zeros (M, 1) ; 




lambda=zeros(M, 1) ; 




H=0; 


Calculate 




XX=X'*X; 




Xy=X'*y; 


while H is less than T do 




nablaV= (2//3) * (Xy-XX*L) -a*&' (orL) ; 




nablaV=nablaV-4*L./(T~2+L.-2) ; 




L=L+h*nablaV+sqrt(2*h)*randn(M, 1) ; 




H=H+h; 




lainbda=lambda+h*L/T ; 


end 


return lambda 



Algorithm 1: The algorithm for computing the EW-aggregate by LMC. 



Choice of T: Since the convergence rate of Lj to X is of the order T ' and the best rate of convergence 
an estimator can achieve is n"'''^, it is natural to set T = n. This choice of T has the advantage 
of being simple for implementation, but it has the drawback of being not scale invariant. A better 
strategy for choosing T is to continue the procedure until the convergence is observed. 
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Figure 2: A typical result of the EWA (left panel) and the Lasso (right panel) in the setup of Example 1 with n = 200, M = 500 and 
S = 20. 



Choice of h: We choose the step of discretization in the form: h = p/{Mn) - yS/Tr(X^X). More details on 
the choice of h and T will be given in a future work. 

Choice of /?, t and a: In our simulations we use the parameter values 

a = 0, yS = 4cr2, r = 4cr/(Tr(X^X))'/2 . 

These values of /3 and t are derived from the theory developed above. However, we take here a = 
and not a > as suggested in Section [3] We introduced there a > Q for theoretical convenience, in 
order to guarantee the geometric mixing of the Langevin diffusion. Numerous simulations show that 
mixing properties of the Langevin diffusion are preserved with o- = as well. 

5.2. Numerical experiments 

We present below two examples of application of the EWA with LMC for simulated data sets. In 
both examples we give also the results obtained by the Lasso procedure (rather as a benchmark, than for 
comparing the two procedures). The main goal of this section is to illustrate the predictive ability of the 
EWA and to show that it can be easily computed for relatively large dimensions of the problem. In all 
examples the Lasso estimators are computed with the theoretically justified value of the regularaization 
parameter cr V8 log M/n (cf. 11). 

5.2.1. Example 1 

This is a standard numerical example where the Lasso and Dantzig selector are known to behave well 
(cf. H). Consider the model Y = X/l* + cr^, where X is a M x n matrix with independent entries, such that 
each entry is a Rademacher random variable. Such matrices are particularly well suited for applications 
in compressed sensing. The noise ^ € M" is a vector of independent standard Gaussian random variables. 
The vector A* is chosen to be S -sparse, where S is much smaller than M. W. 1. o. g. we consider vectors A* 
such that only first S coordinates are different from 0; more precisely. A* - !(] ^ S). Following |9], we 
choose cr^ - S 19. We run our procedure for several values of S and M. The results of 500 replications are 
summarized in Table 1. We see that EWA outperforms Lasso in all the considered cases. 

A typical scatterplot of estimated coefficients for M - 500, n - 200 and S = 20 is presented in Fig.|2] 
The left panel shows the estimated coefficients obtained by EWA, while the right panel shows the estimated 
coefficients obtained by Lasso. One can clearly see that the estimated values provided by EWA are much 
more accurate than those provided by Lasso. 

An interesting observation is that the EWA selects the set of nonzero coordinates of A* even better than 
the Lasso does. In fact, the approximate sparsity of the EWA is not very surprising, since in the noise-free 
linear models with orthogonal matrix X, the symmetry of the prior implies that the EWA recovers the zero 
coordinates without error. 
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M - 100 




M - 200 




M = 500 










Lasso 




Lasso 




Lasso 


n - 


100 5 


= 5 


0.063 


0.344 


0.064 


0.385 


0.087 


0.453 








(0.039) 


(0.132) 


(0.043) 


(0.151) 


(0.054) 


(0.161) 


n — 


100 5 


= 10 


0.73725 


1.680 


1.153 


1.918 


1.891 


2.413 








(0.699) 


(0.621) 


(1.091) 


(0.677) 


(1.522) 


(0.843) 


n — 


100 5 


= 15 


5.021 


4.330 


6.495 


5.366 


8.917 


7.1828 








(1.593) 


(1.262) 


(1.794) 


(1.643) 


(2.186) 


(2.069) 


n = 


:200 5 


= 5 


0.021 


0.151 


0.022 


0.171 


0.019 


0.202 








(0.011) 


(0.048) 


(0.013) 


(0.055) 


(0.012) 


(0.057) 


n — 


200 5 


= 10 


0.106 


0.658 


0.108 


0.753 


0.117 


0.887 








(0.047) 


(0.169) 


(0.048) 


(0.198) 


(0.051) 


(0.239) 


n — 


200 5 


= 20 


1.119 


3.124 


1.6015 


3.734 


2.728 


4.502 








(0.696) 


(0.806) 


(1.098) 


(0.907) 


(1.791) 


(1.063) 



Table 1: Average loss \\A - of the estimators obtained by the EW-aggregate and the Lasso in Example 1. The standard deviation 
is given in parentheses. 

We note that the numerical results on the Lasso in Table 15.2.11 are substantially different from those 
reported in the short version of this paper published in the Proceeding of COLT 2009 [17]. This is because 
in IitIi we used the R packages lars and glmnet, whereas here we use the MATLAB package ll_ls. It 
turns out that in the present example the latter provides more accurate approximation of the Lasso than the 
aforementioned R packages. 

The running times of our algorithm are reasonable. For instance, in the case n = m = 100 and 5 = 10 
the execution of our algorithm is only three times longer than the 11-ls implementation of the Lasso. On 
the other hand, the prediction error of our algorithm is more than twice smaller than that of the Lasso. 

5.2.2. Example 2 

Consider model ^ where Z, are independent random variables uniformly distributed in the unit square 
[0, 1]^ and are iid NiO, cr^) random variables. For an integer A; > 0, we consider the indicator functions 
of rectangles with sides parallel to the axes and having as left-bottom vertex the origin and as right-top 
vertex a point of the form (i/k, j/k), (i, j) e N^. Formally, we define 0j by 

</>(i-i)k+j(x) = h),i]xio.j](kx), Vx e [0, 1]^. 

The underlying image / we are trying to recover is taken as a superposition of a small number of rectangles 
of this form, that is f{x) - Yjf^i X'(4'tix)^ for all x e [0, 1]^ with some A* having a small /"o-norm. We set 
k - 15, ||/l*||o = 3, /Ijg - /IjQQ - /Ijoo - 1- Thus, the cardinaUty of the dictionary is M = = 225. 

In this example the functions i^j are strongly correlated and therefore the assumptions like restricted 
isometry or low coherence are not fulfilled. Nevertheless, the Lasso succeeds in providing an accurate 
prediction (cf. Table 2). Furthermore, the Lasso with the theoretically justified choice of the regularization 
parameter cr -y/8 log M/« is not much worse than the ideal Lasso-Gauss (LG) estimator. We call the LG 
estimator the ordinary least squares estimator in the reduced model where only the predictor variables 
selected at a preliminary Lasso step are kept. Of course, the performance of the LG procedure depends on 
the initial choice of the tuning parameter for the Lasso step. In our simulations, we use its ideal (oracle) 
value minimizing the prediction error and, therefore, we call the resulting procedure the ideal LG estimator. 

As expected, the EWA has a smaller predictive risk than the Lasso estimator However, a surprising 
outcome of this experiment is the supremacy of the EWA over the ideal LG in the case of large noise 
variance. Of course, the LG procedure is faster However, even from this point of view the EWA is rather 
attractive, since it takes less than two seconds to compute it in the present example. 
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tWA Lasso Ideal LG 


cr = l,n = 100 


0.160 0.273 0.128 
(0.035) (0.195) (0.053) 


0- = 2, n = 100 


0.210 0.759 0.330 
(0.072) (0.562) (0.145) 


o- = 4, n = 100 


0.420 2.323 0.938 
(0.222) (1.257) (0.631) 


cr= l,n = 200 


0.130 0.187 0.069 
(0.030) (0.124) (0.031) 


cr = 2, n = 200 


0.187 0.661 0.203 
(0.048) (0.503) (0.086) 


o- = 4, n = 200 


0.278 2.230 0.571 
(0.132) (1.137) (0.324) 



Table 2: Average loss J^^^ ( Hji^j ~ ^p0y(^))^ dx of the the EWA, the Lasso and the ideal LG procedures in Example 2. The 
standard deviation is given in parentheses. 




Figure 3: This figure shows a typical outcome in the setup of example 2 when n = 200 and k = 15. Left: the original image. Center: 
the observed noisy sample with cr = 0.5. Pixels for which no observation is available are in black. Right: the image estimated by the 
EWA. 

6. Conclusion and outlook 



This paper contains two contributions: New oracle inequalities for EWA, and the LMC method for 
approximate computation of the EWA. The first oracle inequality presented in this work is in the line of the 
PAC-Bayesian bounds initiated by McAllester 130] . It is valid for any prior distribution and gives a bound 
on the risk of the EWA with an arbitrary family of functions. Next, we derive another inequality, which 
is adapted to the sparsity scenario and called the sparsity oracle inequality (SOI). In order to obtain it, we 
propose a prior distribution favoring sparse representations. The resulting EWA is shown to behave almost 
as well as the best possible linear combination within a residual term proportional to M* (log M)/n, where 
M is the true dimension, M* is the number of atoms entering in the best linear combination and n is the 
sample size. A remarkable fact is that this inequality is obtained under no condition on the relationship 
between different atoms. 

Sparsity oracle inequalities similar to that of Theorem |2] are valid for the penalized empirical risk 
minimizers (ERM) with a /"o-penalty (proportional to the number of atoms involved in the representation). 
It is also well known that the problem of computing the /"o-penalized ERM is NP-hard. In contrast with 
this, we have shown that the numerical evaluation of the suggested EWA is a computationally tractable 
problem. We demonstrated that it can be efficiently solved by the LMC algorithm. Numerous simulations 
we did (some of which are included in this work) confirm our theoretical findings and, furthermore, suggest 
that the EWA is able to efficiently select the sparsity pattern. Theoretical justification of this fact, as well 
as more thorough investigation of the choice of parameters involved in the LMC algorithm, are interesting 
topics for future research. 
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Appendix: proofs of technical results 

6.1. Proof of Proposition^ 

For brevity, in this proof we denote by || ■ || the EucHdean norm in R*' and we set a = 1 in ( fTOl l. 
The case of general a > is treated analogously. Recall that for some small /; > we have defined the 
M-dimensional Markov chain (Lf ; = 0, 1, 2, . . .) by (cf. ( fTOl l and (fTTll): 

Lf^i = Lf + 2/7/J-'X^(Y - XLf) - hg{Ll) + ^ = 0, 

where (^^,; k - 1, 2, . . .) is a sequence of iid standard Gaussian vectors in R*', and 

In what follows, we will use the fact that the function g is bounded and satisfies g{A) ^ for all A € R"^. 

Let us prove some auxiliary results. Set v - 2y6"'X^Y, A = 2yS"'X^X and assume that h ^ 1/||A||. 
Without loss of generality we also assume that T//; is an integer In what follows, we denote by C > a 
constant whose value is not essential, does not depend neither on T nor on h, and may vary from Une to 
line. Since the function g is bounded and ^^^^^ has zero mean, we have 

E[Lf^J = (/ - /;A)E[Lf ] + /iE[v - ^(Lf )], Mk ^ 0. 

Therefore, 

l|E[Lf^i]|| 11(7 - /;A)E[Lf ]|| + Ch^ ||E[Lf ]|| + Ch, Sk ^ 0. 
By induction, we get 

||E[Lf ]|| Ckh CT, Mk e [0, [Tlh]]. (14) 
Furthermore, since ^j^^y is independent of and Y, we have 

E[||Lf^i ||2] = E[||Lf +hv- hkLl - hg{Ll)\\^] + 2hM 

^ E[||Lf ||2 + IhiLlYiy - kLl) - 2h(LlYg{Ll) + h^\\v - khl - g(Lf )||2] + IhM 

^ E[||Lf ||2 + IhiLlYiv - khl) + lh\kLl\f + lh\v - g(Lf )||2] + 2hM 

^ E[||Lf ||2 + IKLlYv - IhiLlfik - hk^)Ll + 2h^\\v - g{^)\\'] + 2hM 

^ E[\\Lff]+2hE[(Lfy]v + Ch 

^ E[\\Lff]+ChT, \fke[0,[T/h]]. 

Once again, using induction, we get 

E[||Lf ||2] <; CkhT CT^, Mk e [0, [Tlh]]. (15) 

This implies, in particular, that (/!/r)E[||L^^yjj|p] — > as /i — > for any fixed T. 



Proof of Step 1. Denote by tp the function 

\p{A) ^v-kA- giA), MA e R^, 
and define the continuous-time random process (L(,a; ^ f ^ [T/h]h) by 

lT/h]-l 

dL,j,= ^ il/(L^)llkh,(k^i^h)(t)dt+ ^/2dW„ Lo,/, = 0, (16) 

k=0 

where W, is a M-dimensional Brownian motion satisfying Wkh = for ^- The rigorous construction 
of W can be done as follows. Let (B,; ^ t ^ T) he a M-dimensional Brownian motion defined on the 
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same probability space as the sequence (^^; ^ A: ^ [T/h]) and independent of (^^; ^ A: ^ [T/h])- One 
can check that the process defined by 

W, = + B, - Bkh -{j- k){B(k+i)h - Bkh - ^t+i), t e [kh, (k + l)h[ 



is a Brownian motion and satisfies Wth - ^k- 
By the Cauchy-Schwarz inequahty, 



E 



[Tlh]-l 



1 /"J-i 1 1^1 



dt 



= E 



k=0 

[T/h]-l „{k+\)h 



< - y E[||L,,,-L,,,,||2]aff 

lT/h]-i 

- 2 I E[/22||^(Lf)||2 + 2||W,-Wtt||2]flf? 



2h 



3 



— 2 E[||^(Lf)||2]+4M/;. 



*=() 



Using the inequahty \\il/{A)\\ C(l + \\A\\) and ([TSll, we get 

lT/h]-l 



E 



< Ch{\+hT^). 



This completes the proof of the Step 1 . 



Proof of Step 2. Using (fTSI l we obtain 



mm] ^ i r E[iiL,„iiw,,](iiL,,„ii)] -i- r 

^ Jo ^ A Jo 

„ [r//i]-l M,k+\)h 

— y (E[||Lf||2]+/z2E[||^(L^) + ALf||2]+E[||W,-W,A||2])rff 

^ iri Jkh 



C 
TA 



k=0 
[T/h-l] 



€ — J] /2(E[||Lf||2] + C/z2+M/z) < 



k=Q 



(17) 



Thus, choosing, for example, A = T-' we guarantee that limj-^oo lim;,^o E[||T2||] - 0. 

Proof of Step 3. First, note that ( fT6b can be written in the form 

dLrj, = >ff(Lh, f) dt + V2t/W,, Lo,/, = 0, 

where iJ/(Li„t) is a non-anticipative process that equals if/iLkhji) when f e [kh,(k + l)h). Recall that the 
Langevin diff'usion is defined by the stochastic differential equation 

dL, = i//(L,) dt + ^dW,, Lo = 0. 

Therefore, the probability distributions P/^j- and P/,, 7 induced by, respectively, (L,;0 ^ t ^ T) and 
(L,_/,; ^ f ^ r) are mutually absolutely continuous and the corresponding Radon-Nykodim derivatives 
are given by Girsanov formula: 



dPrr 



(L) = exp {-^ ^ (4^(L, t) - ip(L,)Y{dL, - ijjiL,) dt)-\^^ UiL, t) - i)f{L,)f dt]. 
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This implies that the Kullback-Leibler divergence between P^^^ j and Vlj is given by 

^(PL.r|PL,„r) = -E[log (^^(^))] = 4 E[||iA(L, f) - ip{L,)\\^] dt. 

Using the expressions of ^ and 0-, as well as the fact that the function \p is Lipschitz continuous, we can 
bound the divergence above as follows: 



[Tlh]-l ^(k+\)h 

k=o 

lT/h]-l „(k+l)h pt 2 

= C Yj \ m\ilf(L^ds+yl2{W,-Wkh)\]dt. 

^_A Jkh Jkh 



[T/h]-l Mk+\)h 

z 

k=0 

From the Cauchy-Schwarz inequality and the fact that =^ C(\ + we obtain 

[Tlh]-\ p(k+\)h 



^\\^|J{L,)\\ ] dsdt + ChT 

kh Jkh 



[r//!]-l „[k+\)h 

Ch^ Y I E[||iA(L,)||2] t/i + C/zr 

< Ch^ r E[||(/r(L,)||2] + C/zr 
Jo 

^ Ch^ r E[||L,||2] t/s + C/ir. 
Jo 



;o 

l|2 ;„ 



Since by Proposition [T| the expectation of ||Lj|p is bounded uniformly in s, we get '7C(PL,r|Pi,, 7-) ^ as 
/i — > 0. In view of Pinsker's inequality, cf, e.g., |i38^, this implies that the distribution P^^ j converges to 
Vij in total variation as h — > 0. Thus, (fTSl l follows. 



Proof of Step 4. To prove that the right hand side of (fT3] l tends to zero as T — > +00, we use the fact that the 
process L, has the geometrical mixing property with D{A) - e'^H-'H^. Bias-variance decomposition yields: 



E 



The second term on the right hand side of the last display tends to zero as T — > 00 in view of Proposition[Tl 
while the first term can be evaluated as follows: 

i^Varfjl G(L,)dt] = X <^»^o[G(L,),G(L,)]«'f af^ 

^ ^ r r pv^"'^dtdsi^cT-K 

i' JQ Jo 



This completes the proof of Proposition |2l 
6.2. Proof of Lemma\3\ 

We first prove a simple auxiliary result, cf. Lemma |4] below. Then, the two claims of Lemma [3] are 
proved in Lemmas |5] and |6] respectively. 
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Lemma 4. For every M e N and every s > M, the following inequality holds: 

M 



1 r 1— r dUj 



Proof. Let U\, . . ., Um be iid random variables drawn from the scaled Student f(3) distribution having as 
we have 

M J M 



density the function M i-> 2/[;7r(l +m^)^]. One easily checks that E[t/j] = 1. Furthermore, with this notation, 



\ d ' 



In view of Chebyshev's inequality the last probability can be bounded as follows; 



^ ME[Ul] ^ M 



.^^ (5-ME[|t/i|])2 ^ (i-M)2 

and the desired inequality follows. □ 

Lemma 5. Let the assumptions of Theorem\2\be satisfied and let po be the probability measure defined by 
^.IfM^l then 

^{A,-A\fpa{dA)i:AT^e^''"\ 
Proof. Using the change of variables u - (A - A*)/t we write 

(Ai - A\fpo{dA) = Cmt^ u\i n (1 + M2)-2e--(«™,)) du 

J A Jbaim) y\ 



7=1 



with 



Cm = ( (11(1 + M2)-2e--(™"i)) j„) (18) 



where Uj are the components of u. Bounding the functions e "<™"j) by one, extending the integration from 
BiilM) to R*' and using the inequality ^ u\{\ + u\y^du\ ^ n, we get 

r {A, - A\fpo{dA) CmM {(\+ t^r^dtf'' = ICmAtt/I)'', 
J A Jm 

where we used that the primitive of the function (1 + x^y^ is i arctan(x) + Jij+K^- ^'^ bound Cm we first 
use the inequality ^ 2\x\ which yields: 



^ Jb,(2M) )J (1 + Mp^/ V Jg_(2M) )J (1 + up^' 

In view of ( fT9] l and Lemma[3]we have 

Cm s; e'*""'^(2/;r)'^(l - 2/''^'^(2/7r)'^ (20) 

for M ^ 2. Combining these estimates we get 

r (ii - A\fpo(dA) ^ 4r2/-^ 
Ja 

and the desired inequality follows. □ 
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Lemma 6. Let the assumptions of Theorem\2\be satisfied and let po be the probability measure defined by 
0. Then 

M 

'K{pQ,7:) ^ 2{a\\A% + 221og(l + \A]\/t)) + (1 + AMar). 
Proof. The definition of n, po and of the Kullback-Leibler divergence imply that 

= log(CMQ,,«) + 2 log 1^^^}^°^"''*^ 

+ 2^1 L.iiMr) ('^(H) - - '^p))Po(dA). (21) 

We now successively evaluate the three terms on the RHS of HTl . First, in view of Q, we have 

Jb,(«) (1 + m])^ Mm ^ ' 

This and (l20| imply \og{CMCa,T,R) =^ 1 + 4MQfT. 

To evaluate the second term on the RHS of ( 1211 1 we use that 

r' + i^ _ 2T(Aj - A*) ^ Af 

T^ + (Aj-Ap^ - l + ^2 + ^._^p2('^;/^)+^2 + (^._^p2 

1 + \A]/t\ + (Aj/rf ^(1+ \A]/T\f. 

This entails that the second term on the RHS of ( 1211 1 is bounded from above by Yjf=i 21og(l + \A*\/t). 
Finally, since the derivative of w(-) is bounded in absolute value by 2, we have (l>(aAj) - a>{a(Aj - A*)) ^ 
2a|/l*| which implies: 




(LdiaAj) - (b{a{Aj - Aj)))poidA) < 2a||^*||i. 



Combining these inequalities we get the lemma. □ 

6.3. Proofs of remarks\l]i6\ 

We only prove Remarks |2] and |6l since the proofs of the remaining remarks are straghtforward. 

6.3.1. Proof of Remark^ 

Let ^ be a random variable satisfying P(^ = ±cr) = 1/2 and let U be another random variable, inde- 
pendent of ^ and drawn from the uniform distribution on [ - 1 , 1 ] . Recall that ^ = ( 1 + y)cr sgn[cr" ' ^ - ( 1 + 

y)m - ^. 

We start by proving that ^ + ^ has the same distribution as (1 + y)^. Clearly, |^ + f| equals (1 + y)o- 
almost surely. Furthermore, 

P(^ + ^ = (1 + y)(T) = n^r-'^ ^ (1 + y)U) 

= i(p(l ^ (1 + y)U) + P( - 1 > (1 + yW)) 




This entails that P(^ + C - ~(1 + y)o") = 1/2 and, therefore, the distributions of f + f and (1 + y)^ coincide. 
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We compute now the conditional expectation E[^|^]. Since U and ^ are independent, we have 

E[^ I ^ = cr] = (1 + r)(rE( sgn[l - (1 + y)U]) - cr = 0. 

Similarly, E[^ | ^ = -cr] = 0. 

To complete the proof of Remark|2l it remains to show that part iii) of Assumption N is fulfilled. Indeed, 



flycr^ fiycr^ ^\ 2{\ + y) 2{\ + y) 



1 



tya- + log 1 1 + _ i I L 



Applying the inequality of Lemma 3] with oq =2(1 +y)ly and x - tycr, we get 

logE[g'qg = cr] 1 ,2(l+r) , _^ 
3 — 2 ^ — = 1+7 

and the desired result follows. 

6.3.2. Proof of Remark^ 

We start by computing the conditional moment generating function (Laplace transform) of ^ given ^: 

E[e'^|^ = fl]= e-"'E[e'<^+'^>|^ = fl] 

= e-'«Je'(i+r)Wp(sgn(fl) > (1 + y)f/) + e-'('+*lp( sgn(fl) < (1 + 

= e-'«(e'(i+r)«ll21 + e-'(i+y)«^_l (22) 
\ 2 + 2y 2 + 27/ 

Using ( [22] ) we obtain 

L ' J 2 + 27 2 + 27 

since the symmetry of ^ implies that E[e"'"^'''^] = E[e'*'^'>'-'f] for every t. Thus, ( + ^ has the same 
distribution as (1 + y)^. 

On the other hand, taking the derivatives of both sides of (l22T l and using the fact that E[^ \^ - a] equals 
to the derivative of the moment generating function E[e'^ | ^ = a] at f = 0, we obtain that E[^ | ^ = a] = 
for every a e [-B, B]. To complete the proof of Remark|6] we apply U6i, Lemma 3] to the right hand side 
of (l22T i. This yields 

log [E[e'( I ^ = «]) < (tjaf^ ^ (tBfyd + 7)- 
Therefore, part iii) of Assumption N is satisfied with v{a) ^ B^. This completes the proof of Remark|6] 
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